{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Designing a custom layer with ``gluon`` \n",
    "\n",
    "Now that we've peeled back some of the syntactic sugar conferred by ``nn.Sequential()`` and given you a feeling for how ``gluon`` works under the hood, you might feel more comfortable when writing your high-level code. But the real reason to get to know ``gluon`` more intimately is so that we can mess around with it and write our own Blocks. \n",
    "\n",
    "Up until now, we've presented two versions of each tutorial. One from scratch and one in ``gluon``. Empowered with such independence, you might be wondering, \"if I wanted to write my own layer, why wouldn't I just do it from scratch?\" \n",
    "\n",
    "In reality, writing every model completely from scratch can be cumbersome.  Just like there's only so many times a developer can code up a blog from scratch without hating life, there's only so many times that you'll want to write out a convolutional layer, or define the stochastic gradient descent updates. Even in a pure research environment, we usually want to customize one part of the model. For example, we might want to implement a new layer, but still rely on other common layers, loss functions, optimizers, etc. In some cases it might be nontrivial to compute the gradient efficiently and the automatic differentiation subsystem might need some help: When was the last time you performed backprop through a log-determinant, a Cholesky factorization, or a matrix exponential? In other cases things might not be numerically very stable when calculated straightforwardly (e.g. taking logs of exponentials of some arguments). \n",
    "\n",
    "By hacking ``gluon``, we can get the desired flexibility in one part of our model, without screwing up everything else that makes our life easy. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import print_function\n",
    "import mxnet as mx\n",
    "import numpy as np\n",
    "from mxnet import nd, autograd, gluon\n",
    "from mxnet.gluon import nn, Block\n",
    "mx.random.seed(1)\n",
    "\n",
    "###########################\n",
    "#  Speficy the context we'll be using\n",
    "###########################\n",
    "ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()\n",
    "\n",
    "###########################\n",
    "#  Load up our dataset\n",
    "###########################\n",
    "batch_size = 64\n",
    "def transform(data, label):\n",
    "    return data.astype(np.float32)/255, label.astype(np.float32)\n",
    "train_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=True, transform=transform),\n",
    "                                      batch_size, shuffle=True)\n",
    "test_data = mx.gluon.data.DataLoader(mx.gluon.data.vision.MNIST(train=False, transform=transform),\n",
    "                                     batch_size, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining a (toy) custom layer\n",
    "\n",
    "To start, let's pretend that we want to use ``gluon`` for its optimizer, serialization, etc, but that we need a new layer. Specifically, we want a layer that centers its input about 0 by subtracting its mean. We'll go ahead and define the simplest possible ``Block``. Remember from the last tutorial that in ``gluon`` a layer is called a ``Block`` (after all, we might compose multiple blocks into a larger block, etc.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CenteredLayer(Block):\n",
    "    def __init__(self, **kwargs):\n",
    "        super(CenteredLayer, self).__init__(**kwargs)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        return x - nd.mean(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it. We can just instantiate this block and make a forward pass. \n",
    "Note that this layer doesn't actually care \n",
    "what its input or output dimensions are. \n",
    "So we can just feed in an arbitrary array \n",
    "and should expect appropriately transformed output. Whenever we are happy with whatever the automatic differentiation generates, this is all we need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net = CenteredLayer()\n",
    "net(nd.array([1,2,3,4,5]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also incorporate this layer into a more complicated network, such as by using ``nn.Sequential()``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net2 = nn.Sequential()\n",
    "net2.add(nn.Dense(128))\n",
    "net2.add(nn.Dense(10))\n",
    "net2.add(CenteredLayer())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This network contains Blocks (Dense) that contain parameters and thus require initialization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net2.collect_params().initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can pass some data through it, say the first image from our MNIST dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for data, _ in train_data:\n",
    "    data = data.as_in_context(ctx)\n",
    "    break\n",
    "output = net2(data[0:1])\n",
    "print(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can verify that as expected, the resulting vector has mean 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nd.mean(output)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There's a good chance you'll see something other than 0. When I ran this code, I got ``2.68220894e-08``. \n",
    "That's roughly ``.000000027``. This is due to the fact that MXNet often uses low precision arithmetics. \n",
    "For deep learning research, this is often a compromise that we make.\n",
    "In exchange for giving up a few significant digits, we get tremendous speedups on modern hardware.\n",
    "And it turns out that most deep learning algorithms don't suffer too much from the loss of precision."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom layers with parameters\n",
    "\n",
    "While ``CenteredLayer`` should give you some sense of how to implement a custom layer, it's missing a few important pieces. Most importantly, ``CenteredLayer`` doesn't care about the dimensions of its input or output, and it doesn't contain any trainable parameters. Since you already know how to implement a fully-connected layer from scratch, let's learn how to make parametric ``Block`` by implementing MyDense, our own version of a fully-connected (Dense) layer."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "\n",
    "Before we can add parameters to our custom ``Block``, we should get to know how ``gluon`` deals with parameters generally. Instead of working with NDArrays directly, each ``Block`` is associated with some number (as few as zero) of ``Parameter`` (groups). \n",
    "\n",
    "At a high level, you can think of a ``Parameter`` as a wrapper on an ``NDArray``. However, the ``Parameter`` can be instantiated before the corresponding NDArray is. For example, when we instantiate a ``Block`` but the shapes of each parameter still need to be inferred, the Parameter will wait for the shape to be inferred before allocating memory. \n",
    "\n",
    "To get a hands-on feel for mxnet.Parameter, let's just instantiate one outside of a ``Block``:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_param = gluon.Parameter(\"exciting_parameter_yay\", grad_req='write', shape=(5,5))\n",
    "print(my_param)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we've instantiated a parameter, giving it the name \"exciting_parameter_yay\". We've also specified that we'll want to capture gradients for this Parameter. Under the hood, that lets ``gluon`` know that it has to call ``.attach_grad()`` on the underlying NDArray. We also specified the shape. Now that we have a Parameter, we can initialize its values via ``.initialize()`` and extract its data by calling ``.data()``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=ctx)\n",
    "print(my_param.data())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For data parallelism, a Parameter can also be initialized on multiple contexts. The Parameter will then keep a copy of its value on each context. Keep in mind that you need to maintain consistency among the copies when updating the Parameter (usually `gluon.Trainer` does this for you).\n",
    "\n",
    "Note that you need at least two GPUs to run this section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if len(mx.test_utils.list_gpus()) >= 2:\n",
    "    my_param = gluon.Parameter(\"exciting_parameter_yay\", grad_req='write', shape=(5,5))\n",
    "    my_param.initialize(mx.init.Xavier(magnitude=2.24), ctx=[mx.gpu(0), mx.gpu(1)])\n",
    "    print(my_param.data(mx.gpu(0)), my_param.data(mx.gpu(1)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameter dictionaries (introducing ``ParameterDict``)\n",
    "\n",
    "Rather than directly store references to each of its ``Parameters``, ``Block``s typicaly contain a parameter dictionary (``ParameterDict``). In practice, we'll rarely instantiate our own ``ParameterDict``. That's because whenever we call the ``Block`` constructor it's generated automatically. For pedagogical purposes, we'll do it from scratch this one time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd = gluon.ParameterDict(prefix=\"block1_\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "MXNet's ``ParameterDict`` does a few cool things for us. First, we can instantiate a new Parameter by calling ``pd.get()``"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.get(\"exciting_parameter_yay\", grad_req='write', shape=(5,5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the new parameter is (i) contained in the ParameterDict and (ii) appends the prefix to its name. This naming convention helps us to know which parameters belong to which ``Block`` or sub-``Block``. It's especially useful when we want to write parameters to disc (i.e. serialize), or read them from disc (i.e. deserialize).\n",
    "\n",
    "Like a regular Python dictionary, we can get the names of all parameters with ``.keys()`` and can access parameters with:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd[\"block1_exciting_parameter_yay\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Craft a bespoke fully-connected ``gluon`` layer\n",
    "\n",
    "Now that we know how parameters work, we're ready to create our very own fully-connected layer. We'll use the familiar relu activation from previous tutorials."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def relu(X):\n",
    "    return nd.maximum(X, 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can define our ``Block``. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class MyDense(Block):\n",
    "    ####################\n",
    "    # We add arguments to our constructor (__init__)\n",
    "    # to indicate the number of input units (``in_units``) \n",
    "    # and output units (``units``)\n",
    "    ####################\n",
    "    def __init__(self, units, in_units=0, **kwargs):\n",
    "        super(MyDense, self).__init__(**kwargs)\n",
    "        with self.name_scope():\n",
    "            self.units = units\n",
    "            self._in_units = in_units\n",
    "            #################\n",
    "            # We add the required parameters to the ``Block``'s ParameterDict , \n",
    "            # indicating the desired shape\n",
    "            #################\n",
    "            self.weight = self.params.get(\n",
    "                'weight', init=mx.init.Xavier(magnitude=2.24), \n",
    "                shape=(in_units, units))\n",
    "            self.bias = self.params.get('bias', shape=(units,))        \n",
    "\n",
    "    #################\n",
    "    #  Now we just have to write the forward pass. \n",
    "    #  We could rely upong the FullyConnected primitive in NDArray, \n",
    "    #  but it's better to get our hands dirty and write it out\n",
    "    #  so you'll know how to compose arbitrary functions\n",
    "    #################\n",
    "    def forward(self, x):\n",
    "        with x.context:\n",
    "            linear = nd.dot(x, self.weight.data()) + self.bias.data()\n",
    "            activation = relu(linear)\n",
    "            return activation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall that every Block can be run just as if it were an entire network. \n",
    "In fact, linear models are nothing more than neural networks \n",
    "consisting of a single layer as a network.\n",
    "    \n",
    "So let's go ahead and run some data through our bespoke layer.\n",
    "We'll want to first instantiate the layer and initialize its parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dense = MyDense(20, in_units=10)\n",
    "dense.collect_params().initialize(ctx=ctx)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dense.params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can run through some dummy data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dense(nd.ones(shape=(2,10)).as_in_context(ctx))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using our layer to build an MLP\n",
    "\n",
    "While it's a good sanity check to run some data though the layer, the real proof that it works will be if we can compose a network entirely out of ``MyDense`` layers and achieve respectable accuracy on a real task. So we'll revisit the MNIST digit classification task, and use the familiar ``nn.Sequential()`` syntax to build our net."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net = gluon.nn.Sequential()\n",
    "with net.name_scope():\n",
    "    net.add(MyDense(128, in_units=784))\n",
    "    net.add(MyDense(64, in_units=128))\n",
    "    net.add(MyDense(10, in_units=64))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "net.collect_params().initialize(ctx=ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instantiate a loss "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss = gluon.loss.SoftmaxCrossEntropyLoss()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer = gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .1})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Metric"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metric = mx.metric.Accuracy()\n",
    "\n",
    "def evaluate_accuracy(data_iterator, net):\n",
    "    numerator = 0.\n",
    "    denominator = 0.\n",
    "    \n",
    "    for i, (data, label) in enumerate(data_iterator):\n",
    "        with autograd.record():\n",
    "            data = data.as_in_context(ctx).reshape((-1,784))\n",
    "            label = label.as_in_context(ctx)\n",
    "            label_one_hot = nd.one_hot(label, 10)\n",
    "            output = net(data)\n",
    "        \n",
    "        metric.update([label], [output])\n",
    "    return metric.get()[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "epochs = 2  # Low number for testing, set higher when you run!\n",
    "moving_loss = 0.\n",
    "\n",
    "for e in range(epochs):\n",
    "    for i, (data, label) in enumerate(train_data):\n",
    "        data = data.as_in_context(ctx).reshape((-1,784))\n",
    "        label = label.as_in_context(ctx)\n",
    "        with autograd.record():\n",
    "            output = net(data)\n",
    "            cross_entropy = loss(output, label)\n",
    "            cross_entropy.backward()\n",
    "        trainer.step(data.shape[0])\n",
    "            \n",
    "    test_accuracy = evaluate_accuracy(test_data, net)\n",
    "    train_accuracy = evaluate_accuracy(train_data, net)\n",
    "    print(\"Epoch %s. Train_acc %s, Test_acc %s\" % (e, train_accuracy, test_accuracy))    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "It works! There's a lot of other cool things you can do. In more advanced chapters, we'll show how you can make a layer that takes in multiple inputs, or one that cleverly calls down to MXNet's symbolic API to squeeze out extra performance without screwing up your convenient imperative workflow. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next\n",
    "[Serialization: saving your models and parameters for later re-use](../chapter03_deep-neural-networks/serialization.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "For whinges or inquiries, [open an issue on  GitHub.](https://github.com/zackchase/mxnet-the-straight-dope)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
